Medical Decision Making
○ SAGE Publications
Preprints posted in the last 30 days, ranked by how well they match Medical Decision Making's content profile, based on 10 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Lounkaew, K.
Show abstract
National digital health platforms are scaling faster than the evidence on how to finance them. This paper develops a welfare-simulation framework that converts a published willingness-to-pay (WTP) distribution into a prescriptive pricing recommendation, applied to Thailands KhunLook maternal-and-child-health application. Predicted WTP values at the 25th, 50th and 75th unconditional quantiles and the OLS mean -- drawn from a survey of n = 680 Thai parents and relatives of young children previously reported in Lounkaew et al. (2025) -- enter the simulation as parametric inputs. Quintile-level WTP is imputed by monotone-cubic interpolation, a population of 250,000 caregivers is drawn from truncated-Normal distributions around the quintile means, and five financing scenarios are compared: full public provision (S1), a flat market-priced fee (S2), freemium (S3), fine-grained income-tiered pricing (S4), and a means-tested subsidy with a flat fee for the top 60% (S5). A thematic reading of Thai digital-health policy documents bounds the institutionally feasible scenario set and anchors the interpretation of the simulation numbers. Full public provision maximises total welfare at 437.4 million THB but runs a five-year fiscal deficit. The means-tested subsidy gives up about 15% of that welfare to recover 198.6 million THB in net producer surplus, distributes consumer surplus toward lower-income quintiles (concentration index -0.258), and plugs into the existing Thai state welfare card register at near-zero marginal administrative cost. The ranking holds across all twelve sensitivity specifications. Administrative simplicity in subsidy targeting, read against the Thai WTP distribution, dominates finer-grained tiering on both welfare and equity grounds. The framework transfers cleanly to other middle-income countries deciding how to price a national digital health platform. Author summaryMany middle-income-country governments now run free national smartphone apps for the health of mothers and young children, but the funding model is increasingly fragile as initial donor and research grants run out. The question this paper asks is simple: if such a platform had to start charging, what pricing structure would raise the most money without locking out the families who need the app most? Using a published Thai survey of 680 parents and relatives of young children, the paper simulates five alternative designs -- free, flat fee, freemium, fine-tiered by income quintile, and a means-tested subsidy -- and finds that offering the bottom 40% of households free access while charging the top 60% a flat 395 Thai baht per year (roughly USD 11) captures 85% of the welfare of the status-quo free model, generates 199 million baht of fiscal surplus over five years, and distributes benefits toward lower-income users rather than toward the well-off. The design works because Thailands state welfare card register already identifies the low-income target population, so means-testing is essentially free to administer. Other countries with comparable social registries can apply the same logic to their own digital health platforms.
Lieberthal, R. D.; Buontempo, P.; Harmon, B.; Omosule, A.; Washabaugh, M.; Whittaker, A.
Show abstract
BackgroundCell and gene therapies (CGT) represent a transformative class of medical interventions, yet their high production costs limit patient access. Understanding the structure of manufacturing costs is essential for informing policies that can expand access to these therapies. ObjectiveThis study develops and applies a cost-of-goods-sold (COGS) model to analyze the contributors to manufacturing costs for mRNA-based CGT, with application to a wide range of current and future therapies. MethodsAn Excel-based COGS model was constructed based on cost categories for CGT. Two mRNA-based products at commercial scale were used to populate the model: an mRNA vaccine and a therapeutic mRNA gene therapy. Cost inputs were drawn from vendor pricing, peer-reviewed and grey literature, and expert consultation with CGT manufacturing specialists. Three scenarios (worst, base, and best case) were modeled across six cost categories: materials, consumables, capital, labor, licenses, and royalties. A tornado diagram sensitivity analysis was conducted to identify key cost drivers. The mRNA vaccine was used to build and validate the model strucutre using publicly available data sources. The therapeutic mRNA therapy was used as the main use case for illustration and sensitivity analysis. ResultsUnder base-case assumptions, the estimated cost per dose for the therapeutic mRNA product is $56.09, ranging from $3.68 (best case) to $383.22 (worst case). Licensing and royalty fees together account for approximately 83% of total base-case COGS ($6,996,000 and $6,960,000 per production run, respectively, out of $16,825,597 total). Excluding these fees, material costs represent the largest remaining share (61%), followed by consumables (34%), capital (4%), and labor (1%). Sensitivity analysis confirms that licensing and royalty assumptions are the dominant source of uncertainty in the model. ConclusionsLicensing and royalty fees are the primary driver of mRNA-based CGT production costs and represent the greatest opportunity for cost reduction through policy intervention. Strategic priorities for cost reduction should focus on optimizing reagent utilization, increasing platform potency, and expanding use of contract development and manufacturing organizations (CDMOs) to reduce capital and labor costs. Key PointsProducing an example mRNA gene therapy costs about $56 per dose to manufacture, driven almost entirely driven by fees paid to patent holders for the underlying technology. Licensing and royalty fees cost roughly 83 cents of every dollar spent on these new biopharmaceutical products. Until that changes, the gap between what therapies cost to make and what patients and payers are charged will remain very wide.
Hinkel, J.; Modi, S.; Ray, A.; Brill, J.
Show abstract
Background: In-laboratory polysomnography (PSG) remains the diagnostic reference standard for sleep disorders but is resource-intensive and capacity-constrained. Limited-channel home sleep apnea testing (HSAT) improves access and reduces costs compared to in-laboratory polysomnography, but underestimates disease severity due to its inability to measure true sleep time and cannot identify non-respiratory sleep disorders including periodic limb movement disorder and parasomnias.1-5 Comprehensive home polysomnography (hPSG) may preserve diagnostic fidelity while reducing system costs, improving access for patients unable to attend laboratory-based studies, and shortening time to diagnosis and therapy initiation. Objective: To estimate the short-term budget impact to a U.S. commercial health plan of substituting an appropriately selected proportion of in-laboratory PSG with comprehensive hPSG using the Onera Sleep Test System (STS). Methods: We developed a transparent budget impact model following ISPOR good practice guidelines for a hypothetical 1-million-member commercial plan. The model estimates the annual diagnostic population (top-of-funnel) using age- and sex-stratified prevalence, an undiagnosed fraction of 85%, symptom prevalence among undiagnosed individuals (30%), and an annual testing rate (12%).2-3 Baseline costs reflect current diagnostic pathways using HSAT (50% first-line) and in-laboratory PSG (50% first-line), including HSAT-to-PSG escalations (20%) and PSG repeats (4%). The intervention scenario substitutes a defined share of in-laboratory PSG and selected HSAT with Onera hPSG. Scenario and sensitivity analyses explore parameter uncertainty. Results: In the base case, approximately 4,364 individuals entered the OSA diagnostic workflow annually. Baseline diagnostic costs were estimated at $6.23 PMPM, comprising $5.45 million in PSG costs and $0.79 million in HSAT costs. Introducing Onera hPSG (30% PSG replacement, 5% HSAT replacement in Year 1) reduced per member costs to $5.66 PMPM, yielding net savings of $0.57 PMPM ($567,262 annually). In Year 3 scenarios (60% PSG, 10% HSAT replacement), savings increased to $1.64 PMPM (approximately $1.64 million annually). Sensitivity analyses demonstrated net savings ranging from $0.03 to $8.05 PMPM, depending on adoption levels. Conclusions: Partial substitution of in-laboratory PSG with Onera hPSG may yield incremental budget savings for U.S. commercial payers while maintaining access to full polysomnographic assessment. Results support further payer-specific analyses incorporating real-world utilization and downstream outcomes. Keywords: obstructive sleep apnea; polysomnography; home sleep testing; budget impact analysis; health economics
Liu, R.; Jong, C.; Li, H.; Cao, Y.; Yao, Q.; Yamana, T.; Pei, S.; Du, H.
Show abstract
Effective pandemic response requires accurate modeling of population compliance with non-pharmaceutical interventions (NPIs), yet most epidemic models treat behavioral change as fixed scenarios rather than an emergent process. Here, we test whether large language model (LLM)-based agents can generate individualized behavioral responses to time-varying NPIs and disease risk. We instantiate demographically representative agents in three U.S. cities (Boston, Denver, San Antonio) and condition them on evolving outbreak conditions and policies during the early COVID-19 pandemic, without fitting to observed mobility data. Across three frontier LLMs and their ensemble, agents generate zero-shot mobility changes across restaurants, retail, and entertainment venues, benchmarked against cellphone-derived foot-traffic records. The simulations recover average mobility trends across cities and venue types but exhibit overly narrow within-city variation. The three LLMs display distinct biases, while an ensemble approach improves robustness and overall performance. These findings establish LLM agents as a promising framework for modeling adherence to NPIs and highlight the need for further fine-tuning and empirical validation before they can support policy analysis.
Lovecchio, G.; Solnes Miltenburg, A.; Kiritta, R.; Kihunrwa, A.; Staff, A. C.; Chola, L.
Show abstract
Pre-eclampsia (PE) is a major contributor to maternal and neonatal morbidity and mortality. Research in high-income countries has shown that biomarker-based PE screening can improve timely detection and management of women at high risk of PE. We conducted a pre-trail cost-effectiveness analysis of introducing biomarker screening in Tanzania to identify key parameters informing a full, trial-based economic evaluation. We developed a decision tree comparing current practice with two biomarker-based screening strategies: Strategy 1 introducing placental growth factor (PlGF), and Strategy 2 adding soluble fms-like tyrosine kinase-1 (sFlt-1)/PlGF ratio. Under both strategies we assumed early aspirin prophylaxis and/or close monitoring for high-risk women. For each of the three options, we modelled PE diagnosis, as well as maternal and neonatal outcomes over a one-year time horizon, assuming a healthcare sector perspective. We quantified health outcomes in terms of disability-adjusted life years (DALYs) and costs in 2023 US$. When compared to current practice, the incremental cost per DALY averted was $410.45 for Strategy 1 and $1,011.78 for Strategy 2. Limiting the novel strategies to nulliparous women, decreased incremental cost-effectiveness ratios to $184.15 (Strategy 1) and $413.33 (Strategy 2). Key parameters impacting cost-effectiveness were PE prevalence, biomarker screening accuracy, adherence to and effectiveness of preventive treatment and monitoring, and related costs. Based on our findings, biomarker screening has the potential to be cost-effective in Tanzania, particularly if introduced early in pregnancy and targeted at nulliparous women. Further research in low-resource settings is needed to overcome the current data and evidence gaps.
Mandaliya, P.; Barasa, E.; Aywak, D.; Okalebo, F.
Show abstract
Breast cancer was the leading cause of cancer-related mortality among women worldwide in 2022. In Kenya, more than a quarter of breast cancer patients have the aggressive Human Epidermal Growth Factor Receptor 2 positive subtype. Trastuzumab is recommended for its treatment, but high costs have limited access. This study evaluated the cost-effectiveness and affordability of trastuzumab-based regimens to inform their adoption and use in Kenya. A cost-utility analysis was conducted from the healthcare payer perspective over a lifetime horizon. Five trastuzumab-based regimens of varying durations (9-week, 6-month, 9-month, 12-month, and 24-month) were compared with chemotherapy alone. Direct medical costs were estimated using a bottom-up micro-ingredient approach. All costs were reported in 2022 USD. A cohort Markov state-transition model with a monthly cycle length was used to estimate the costs and outcomes for an open hypothetical cohort. Scenario, deterministic sensitivity and probabilistic sensitivity analyses were conducted. A budget impact analysis estimated the financial implications of each regimen. The 9-week regimen had the lowest incremental cost-effectiveness ratio (ICER) of USD 3,230 per QALY, while the remaining regimens had ICERs ranging from USD 4,046 to 9,846 per QALY. The findings were most sensitive to the price and quantity utilized per cycle of trastuzumab. A reimbursement cap of KES 40,000 per cycle reduced ICERs by up to 61%. Over five years, the 9-week regimen would account for 1.2% of the projected insurers budget, whereas the current recommended 12-month regimen would consume 2.82%. Although none of the regimens were cost-effective at Kenyas WTP threshold (USD 1054.80), the 9-week regimen may still be considered by policymakers given its greater affordability. Further cost reductions can be achieved through negotiating lower drug prices, improving access to biosimilars, and implementing vial sharing.
Conde, F.
Show abstract
Background: Health-related social needs (HRSNs), particularly housing instability, are significant drivers of poor health outcomes among Medicaid populations. New York State's Social Care Networks (SCNs) aim to systematically connect members to housing services through coordinated referral systems. However, limited systematic analysis of referral patterns hinders quality improvement efforts. We analyzed housing referral outcomes and workflows to identify barriers to successful service connections. Methods: We conducted a mixed-methods quality improvement study at Public Health Solutions' WholeYouNYC SCN Coordination Center. Quantitative analysis examined 4,258 housing referrals submitted between June 2025 and January 2026, extracted from the Unite Us platform via Power BI dashboard. We calculated acceptance rates, analyzed time metrics, and examined outcomes by receiving organization. Qualitative data were collected through structured consultations with 7 staff members (5 navigators, 2 supervisors) and review of internal workflow documentation. Process mapping identified workflow bottlenecks. Results: Of 4,258 housing referrals, only 45% (n=1,936) were accepted by receiving organizations, while 19% (n=815) were rejected and 32% (n=1,382) remained awaiting response with no recorded action. Average time to acceptance was 8 days for accepted referrals. Acceptance rates were consistent across top receiving organizations (44-46%), suggesting systemic rather than partner-specific barriers. Analysis of unresolved referrals revealed prolonged cases, with the longest pending 271 days. Three critical workflow bottlenecks were identified: CBO response delays, missing housing documentation, and challenges with client engagement. Conclusions: Low housing connection rates (45%) and prolonged unresolved referrals (up to 271 days) indicate systemic barriers requiring interventions at multiple levels. Recommendations include establishing CBO response time benchmarks, implementing automated follow-up protocols, standardizing documentation requirements, and enhancing real-time data monitoring. These findings provide an evidence-based framework for quality improvement in social care coordination programs.
Chevalier, J. M.; Stellbrink, L. M.; Steijvers, L.; Wijnen, S.; van Daalen, F.; Kojan, L.; Li, N.; Jahn, B.; Siebert, U.; Calero Valdez, A.; Hiligsmann, M.; Crutzen, R.; Dukers-Muijrers, N. H.; Kretzschmar, M. E.
Show abstract
Individuals adapt their behavior in response to infectious disease epidemics. Understanding the determinants of behavior, particularly the impact of infections themselves, can help model the feedback loop between disease and behavior in epidemic models. We combined the Imperial College London YouGov COVID-19 behavior survey with hospitalization data and the Oxford COVID-19 government response tracker stringency index to identify the key predictors of three health behaviors--social distancing, masking, and personal protective measures (e.g. handwashing)-- during an early phase of the COVID-19 pandemic in six different countries. We compared two machine learning algorithms--logistic regression with stepwise Akaike Information Criterion and extreme gradient boosting (XGBoost). Top predictors of health behavior were perceived disease severity, hospitalizations, willingness to isolate, and intervention effectiveness, across the six countries. Logistic regression and XGBoost had comparable performance. Machine learning algorithms trained on real-world data could be used to predict individual behavior uptake in agent-based network models.
Sayed, A. M.; Huan, P. T.; Nguyen, T. K.; Fathy, E.; Aziz, T.; Tho, D. V.; Huy, N. T.
Show abstract
BackgroundIncomplete dissemination of clinical trial results remains an important challenge for research transparency and evidence synthesis. Although prior studies have quantified the overall extent of non-dissemination, less is known about whether trial characteristics observable at registration are associated with subsequent dissemination within sponsor portfolios. Methods and findingsWe conducted a retrospective cohort study of 17,537 completed interventional clinical trials registered on ClinicalTrials.gov between 2007 and 2024 across the 20 largest global pharmaceutical companies. We developed the Operational Complexity Index (OCI), a composite measure derived from planned enrollment, facility count, and geographic scope, and examined its association with trial dissemination using multivariable logistic regression and time-to-event analyses. Higher OCI was associated with greater odds of dissemination (adjusted odds ratio [aOR] = 2.40, 95% CI 2.23-2.60; p < 0.001), with dissemination increasing from 47% in the lowest OCI decile to 95% in the highest. Higher operational complexity was also associated with earlier dissemination; over a 1,095-day horizon, high-OCI trials were disseminated a mean of 310.88 days earlier than low-OCI trials (RMST difference, 310.88 days; 95% CI 300.59-320.96; p < 0.001). This pattern was observed across sponsors, clinical phases, and therapeutic areas. In predictive analyses using registration-time variables, the structural model achieved a cross- validated AUC of 0.816 and a holdout AUC of 0.814, whereas the full model, including sponsor identity, achieved a cross-validated AUC of 0.858 and a holdout AUC of 0.857. Using benchmark phase-based costing assumptions, the 5,019 non-disseminated trials corresponded to an estimated US$10.94-15.26 billion in sunk research investment. ConclusionsAmong trials conducted by the 20 largest pharmaceutical sponsors, greater operational complexity at registration was associated with a higher likelihood of dissemination and earlier dissemination. These findings suggest that aggregate sponsor-level transparency metrics may mask important heterogeneity within sponsor portfolios. Future work should assess whether registration-time trial characteristics can help identify trial subgroups at higher risk of non-dissemination. AUTHOR SUMMARYO_ST_ABSWhy was this study done?C_ST_ABSO_LIIncomplete dissemination of clinical trial results reduces the completeness of the medical evidence base and the public value of research participation. C_LIO_LIPrevious studies have described overall rates of trial non-dissemination, but less is known about whether dissemination varies systematically across different types of trials within sponsor portfolios. C_LIO_LIWe examined whether trial characteristics available at registration were associated with later dissemination of results among large pharmaceutical sponsors. C_LI What did the researchers do and find?O_LIWe analyzed 17,537 completed interventional clinical trials sponsored by the 20 largest pharmaceutical companies and registered on ClinicalTrials.gov between 2007 and 2024. C_LIO_LIWe developed an Operational Complexity Index (OCI) based on planned enrollment, number of facilities, and geographic scope to measure trial operational scale at registration. C_LIO_LIHigher OCI was associated with a greater likelihood of dissemination and earlier dissemination. Dissemination ranged from 47% in the lowest OCI decile to 95% in the highest. C_LIO_LIThis pattern was observed across sponsor portfolios, clinical phases, and therapeutic areas, with an average within-sponsor dissemination gap of 40 percentage points between lower- and higher-complexity trials. C_LIO_LIIn manual validation of 344 sampled trials, the automated dissemination-classification pipeline achieved 92.1% accuracy. C_LIO_LIUsing benchmark phase-based costing assumptions, the 5,019 non-disseminated trials corresponded to an estimated US$10.9-15.3 billion in sunk research investment. C_LI What do these findings mean?O_LIDissemination was not uniform across trial types within sponsor portfolios; trials with lower operational complexity were less likely to be disseminated than trials with higher operational complexity. C_LIO_LIAggregate sponsor-level transparency measures may therefore miss important differences within portfolios. C_LIO_LIRegistration-time trial characteristics showed predictive signal for non-dissemination, but whether such information could support monitoring strategies would require prospective validation. C_LIO_LIMore complete dissemination of trial results would strengthen the scientific record and improve the public value of clinical research. C_LI
Kleper, S. L.; Melamed, R. D.
Show abstract
Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.
Golshani, P.; Joseph, M. S.
Show abstract
ObjectiveTo characterize the magnitude and geographic distribution of commercially negotiated hospital facility rates for fourteen common interventional radiology (IR) procedures using publicly posted Hospital Price Transparency Machine-Readable Files (MRFs), and to describe the relationships between state-level commercial pricing, population rurality, and within-system rate uniformity. MethodsIn this cross-sectional observational analysis, we examined hospital-weighted commercial rate observations from U.S. hospital MRFs for fourteen IR procedures spanning image-guided drainage, embolization, peripheral vascular intervention, dialysis access maintenance, and percutaneous spine. The unit of analysis was one observation per distinct negotiated rate per state-CPT cell, deduplicating multi-facility same-system reporting in which two or more hospitals posted identical rate, range, and payer-count tuples. Outliers were excluded using transparent absolute and CMS-relative bounds. State-level statistics were computed where [≥]5 distinct hospital-system observations were reported. Commercial rates were compared to CY 2026 CMS Outpatient Prospective Payment System (OPPS) facility payments. Relationships between state-level commercial rate and 2020 U.S. Census percent-rural population were assessed by Spearman rank correlation. ResultsAcross 14 procedures, state-level commercial median rates varied 3.7-to 8.3-fold between the highest- and lowest-priced states. The largest spreads were observed for fem-pop angioplasty (CPT 37224, 8.3-fold), fem-pop atherectomy (37225, 8.1-fold), and iliac stenting (37221, 7.1-fold). National median commercial rates ranged from 1.34x (PAE/GAE) to 3.60x (paracentesis) the corresponding CMS OPPS facility payment. Across all 14 procedures, the relationship between state percent-rural and median commercial rate was negative (mean Spearman {rho} = -0.46, range -0.33 to -0.80; 14 of 14 codes negative), with the most-rural quartile of states showing a median commercial rate 42% below the most-urban quartile. Deduplication identified 660 multi-facility groups in which a single negotiated rate was applied across two or more affiliated hospitals within a state. DiscussionSubstantial state-level variation in commercially negotiated facility rates exists for common IR procedures, with consistently lower rates in more rural states. Within-system rate uniformity is a frequent feature: many regional health systems post identical commercial rates across multiple owned facilities. The findings are consistent with prior literature linking commercial pricing to market structure and support continued investment in price transparency as a precondition for informed decision-making.
Belski, V.; Lukina, K.
Show abstract
Emergency department (ED) triage assigns patients a five-level Emergency Severity Index (ESI) score that determines care priority. We investigate the feasibility of automating this process, comparing large commercial models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, MedGemma) against a purpose-built pipeline combining a small extraction model with a deterministic clinical engine, and a 9B-parameter language model trained with structured chain-of-thought supervision and reinforcement learning. Off-the-shelf large models achieve only 45-55% exact ESI accuracy while being impractical for clinical deployment due to privacy constraints, cost, and latency. Our specialized BiomedBERT [4] pipeline achieves 88.9% exact accuracy with 97.2% adjacent accuracy ({+/-}1 ESI) on a 50-case expert-labeled evaluation set, approaching nurse inter-rater agreement. A Qwen3.5-9B model [16] fine-tuned with chain-of-thought supervision achieves 75.0% exact / 97.2% adjacent accuracy on a 36-case narrative evaluation. Ongoing GRPO training [13] with a clinically asymmetric reward function and 2,776 ESI-1 narrative training cases (previously 22, due to a discovered extraction bug) shows strong early reward signal. We document 37+ BERT experiments, multiple LLM training cycles, systematic data quality audits, and the specific engineering decisions that enabled progress, including the discovery that 71% of training labels for altered mental status were false positives.
Bowen, H. P.; O'Loughlin, G.; Schleicher, C.; Schulthess, D.
Show abstract
Background: The impact of the Inflation Reduction Act (IRA) upon late-stage developments has been assumed to be limited. The Congressional Budget Office's IRA analysis excluded post-approval innovation, potentially overlooking substantial economic risks to drug developers and declines in the availability of treatments in areas of high unmet medical need such as oncology. Methods: A total of 1148 secondary trials from 364 FDA-approved medicines, published from 2018 to 2025, were obtained from Biomedtracker and clinicaltrials.gov. Using fractional multinomial logit, we model the share distribution of secondary indication studies across 19 disease groups and assess the change in this distribution post-IRA. We also assessed the number of secondary treatment studies pre- vs. post-IRA using multiple linear regression. Results: After the IRA's introduction, small molecule follow-on studies in oncology exhibited a statistically significant 35% decline (R2 = .48, p < 0.014) and lead indication small molecule oncology approvals exhibited a statistically significant 27% decline (R2 = .70, p < 0.002). We also find a statistically significant 14% decline in the share of orphan oncology studies pre- vs. post-IRA (p<0.001). Research Conclusions: This study's results refute claims that the IRA would have minimal negative effects on patient access or late-stage biopharmaceutical R&D. We hope this study reinvigorates debate about the law's unintended consequences and encourages thoughtful policy solutions, as the IRA manifestly creates disincentives that negatively impact patients seeking needed new medicines, particularly those requiring cures addressing metastatic late-stage cancers.
Adibi, A.; Le, K. X.; Pierson, E.; Diao, J. A.; Esfandiari, N.; Carlsten, C.; Sadatsafavi, M.
Show abstract
Importance: Several professional medical societies have removed race and ethnicity from widely used clinical algorithms with implications for millions of patients. Yet the opinions of patients and the public regarding the tensions underlying these pivotal changes have not been systematically explored. Objective: To assess global public opinion on the use of race or ethnicity in clinical algorithms, including preferences for different approaches to algorithmic reform and perceptions of alternative predictors. Design: Cross-sectional survey study. Setting: Multinational opt-in online survey conducted via Prolific in January 2026. Participants: A volunteer convenience sample with quota sampling to achieve approximately equal participation by sex at birth and across ten categories of self-identified race and ethnicity. Main Outcomes and Measures: Self-reported comfort with demographic and social predictors in clinical calculators, with net comfort defined as percentage extremely or somewhat comfortable minus percentage extremely or somewhat uncomfortable; preferences for race-specific versus race-free algorithms; perceptions of algorithmic harm or benefit. Results: Of 1,050 responses, 994 (94.7%) met eligibility criteria. Participants resided in 43 countries with a median age of 32.0 years (IQR, 26-41). Net comfort with the use of race or ethnicity in a hypothetical cancer risk calculator was +62.4% (95% CI: +57.8% to +66.9%), compared with +14.5% (95% CI: +9.1% to +19.9%) for postal or ZIP code. Overall, 87.9% (95% CI: 85.9% to 90.0%) were comfortable with race or ethnicity if a clinician explained its use and only 12.8% agreed race and ethnicity should never be used clinically. Across spirometry, kidney function, and cardiovascular risk calculators, 40.0% to 47.6% preferred race-specific versions, whereas 16.7% to 28.2% preferred race-free alternatives. Furthermore, a substantial proportion disagreed that they were well-represented by race and ethnicity categories, ranging from 22.1% for osteoporotic fracture risk equations to 42.9% for cardiovascular risk equations. These findings were consistent across countries, self-identified race and ethnicity, and among participants reporting prior experiences of racism in healthcare. Conclusions and Relevance: In our diverse multinational survey study, respondents were comfortable with the use of race and ethnicity across application areas, but often did not feel represented by existing categories and were less comfortable with the use of alternatives based on postal or ZIP codes.
Tuttle, M.; Maas, C. C. H. M.; An, J.; Wessler, B. S.; Harvey, W. F.; Selker, H. P.; van Klaveren, D.; Kent, D. M.
Show abstract
The Epic Sepsis Model version 2 (ESMv2) is a prediction model embedded into the electronic medical record used to warn clinicians which hospitalized patients are at risk for sepsis. We conducted a retrospective cohort study of 31,951 hospitalizations of 25,760 patients to compare analyses conducted at the commonly used patient-level (where a maximum prediction prior to the onset of sepsis is used to measure performance) vs novel prediction-level (where each prediction is used to measure performance). Sepsis, defined by the Sepsis 3 criteria occurred during 1,049 hospitalizations (3.3%). Patient-level analyses suggested excellent discrimination AUC 0.86; [IQR 0.85, 0.87], whereas prediction-level analyses demonstrated lower performance AUC 0.62; [IQR 0.57, 0.65]. Low estimates of the positive predictive value (14.5% at the patient level vs 4% at the prediction level) imply a high number of false alerts. Common evaluation approaches may overstate the performance of dynamic prediction models and mislead clinical decision-making.
Wittlinger, S.; Meerjansen, J.; Wolf, F.; Wiest, I. C.; Ebert, M. P.; Siegel, F.; Belle, S.
Show abstract
ObjectiveStructured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. MethodsMultiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates ResultsHuman error rates rose as LLM agreement fell, from 0% at scores 3-4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987-0.994) and AUC-PR of 0.893 (95% CI 0.851-0.929) for error detection. DiscussionMulti-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. ConclusionMulti-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.
Streicher, N. S.
Show abstract
Background and ObjectivesPatient portals have become essential infrastructure for healthcare delivery following the 21st Century Cures Act, yet adoption remains inequitable. Understanding demographic and geographic determinants of portal activation is critical for addressing digital health disparities, particularly among neurology patients who face unique access barriers. We examined the demographic, geographic, and neighborhood-level factors associated with patient portal activation among neurology patients at multiple geographic scales in the Washington, DC metropolitan area. MethodsWe conducted a retrospective cohort study of 72,417 adult neurology patients seen at two academic medical centers sharing an electronic health record in Washington, DC (February 2021-February 2026). We examined portal activation using multivariable logistic regression and geographic analysis at four nested scales: the metropolitan catchment area, DCs eight wards, individual census tracts (via geocoded patient addresses), and individual DC residents. ResultsPortal activation was 64.7% overall. Activation varied by race/ethnicity (Non-Hispanic White 76.1%, Non-Hispanic Black 57.0%, Non-Hispanic Asian 57.6%, Hispanic 55.0%) and geography (DC Ward 2: 82.0% vs. Ward 7: 48.0%). Ward-level educational attainment (r = 0.948), broadband access (r = 0.889), and income (r = 0.811) were strongly correlated with activation. Within individual wards, Non-Hispanic White patients activated at 84-91% while Non-Hispanic Black patients activated at 48-64%, demonstrating that neighborhood resources alone do not explain disparities. DiscussionPatient portal activation is shaped by demographic, socioeconomic, and geographic factors operating at multiple levels. Persistent within-ward racial disparities indicate that geographically targeted interventions must be paired with culturally tailored approaches to achieve digital health equity.
Bressman, E.; Auerbach, A.; Keniston, A.; Jens, C.; Ranji, S.
Show abstract
Introduction: The use of artificial intelligence (AI) by clinicians has increased rapidly in recent years, with large language models (LLMs) emerging as tools that can equal clinician diagnostic performance in simulated settings. However, limited data exist regarding physicians use of LLMs in real-world clinical practice. This study aimed to evaluate the frequency of LLM use among practicing hospitalists, identify which LLMs are most commonly utilized, and assess hospitalists' perceptions of the benefits and limitations of LLM use in clinical care. Methods: We conducted a cross-sectional survey study of academic hospital medicine faculty across 8 institutions within the Hospital Medicine Reengineering Network (HOMERuN), a collaborative research consortium. Eligible participants included hospitalists practicing within participating HOMERuN sites during the study period. The survey assessed the frequency of LLM use, types of LLMs used, clinical applications, and physician perceptions regarding usefulness, efficiency, and concerns associated with LLM adoption. Results: 170 respondents (67.1%) reported ever using an LLM in clinical practice. Among LLM users, OpenEvidence was the most used tool (88.9%), followed by ChatGPT (58.5%), Google Gemini (26.9%), and Microsoft Copilot (20.5%). Only a minority of hospitalists reported using LLMs daily while seeing patients. The most common use cases of LLMs were answering diagnostic (77.1%) and management (77.6%) questions. A majority also reported using LLMs to identify or summarize primary literature (60.0%). Lack of trust in outputs (49.8%), uncertainty around institutional policies (48.6%), and lack of access to secure applications (43.1%) were cited as the most frequent barriers to using LLMs in practice. Discussion: The use of LLMs in clinical practice is already widespread, though regular or daily use is not yet typical. Concerns regarding reliability, patient privacy, and safe integration into clinical workflows remain significant barriers to broader adoption. The responsible implementation of LLMs in hospital medicine will require addressing these barriers.
Fitch, K. V.; Santaularia Gomez, N. J.; Tanveer, M.; Holmes, G. M.; Moracco, K. E.; Fliss, M. D.; Fulcher, N.; Ranapurwala, S. I.
Show abstract
Introduction: Even though state minimum wage (MW) is a policy lever that affects income and poverty and can prevent of violence, no prior study has comprehensively evaluated its impact in the United States (US). In this study, we estimated the impact of at least a $1 USD increase in state MW above the federal MW on overall, firearm, and non-firearm homicide mortality and examined its impact on racialized inequities. Methods: We conducted a quasi-experimental study using controlled interrupted time series (CITS) and synthetic controlled interrupted time series (SCITS) approaches to examine immediate and sustained impact of state MW increases. We used state-month level homicide victimization mortality data from 2010-2019. Homicide victimization death was identified using International Classification of Disease codes, 10th revision. State MW data was obtained from the Bureau of Labor Statistics. Results: Demographic and social variables from intervention, never-exposed, and always-exposed states were similar to each other and representative of the total US population from all 50 states. The CITS results show that after MW increases in the intervention states, these states experienced a sustained decline of -0.22 (-0.37, -0.07) homicide victimizations/ 100,000 person-years/ year relative to the never-exposed states and -0.39 (-0.59, -0.18) relative to always-exposed states. This resulted in 5,657 fewer homicide victimization deaths in the intervention states over four years of post-MW increase period compared to the never-exposed states. SCITS results were similar to the CITS results, and the majority of sustained declines were observed in firearm-related deaths and among Black population. Conclusion: MW increase was associated with a reduction in homicide victimization rates, which were robust in multiple sensitivity analyses, more pronounced for firearm-related homicide deaths, and reduced homicide victimization inequities for Black Americans.
Irlmeier, R.; Jin, Z.; Ye, F.
Show abstract
Background Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments. Methods We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p[isin][p0L, p0U] for binary endpoints and {lambda}[isin][{lambda}0L, {lambda}0U] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: sup{theta}[isin]{theta}0 Pr{theta}(Go) [≤] . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration - p0U and {lambda}0L - but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size. Results Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control. Conclusions INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.